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ABSTRACT 

The majority of bacterial genes are located on the 
leading strand, and the percentage of such genes 
has a large variation across different bacteria. 
Although some explanations have been proposed, 
these are at most partial explanations as they 
cover only small percentages of the genes and do 
not even consider the ones biased toward the 
lagging strand. We have carried out a computational 
study on 725 bacterial genomes, aiming to elucidate 
other factors that may have influenced the strand 
location of genes in a bacterium. Our analyses 
suggest that (i) genes of some functional categories 
such as ribosome have higher preferences to be on 
the leading strands; (ii) genes of some functional 
categories such as transcription factor have higher 
preferences on the lagging strands; (iii) there is a 
balancing force that tends to keep genes from all 
moving to the leading and more efficient strand 
and (iv) the percentage of leading-strand genes in 
an bacterium can be accurately explained based 
on the numbers of genes in the functional 
categories outlined in (i) and (ii), genome size and 
gene density, indicating that these numbers impli- 
citly contain the information about the percentage 
of genes on the leading versus lagging strand in a 
genome. 

INTRODUCTION 

It has been observed that the majority of bacterial genes 
tend to be located on the leading strand in a genome, and 
the percentage of such genes has a large variation across 
different bacteria, ranging from ~45% to ~90% (1,2). 



A number of studies have been carried out aiming to 
provide explanations for such observations. A key factor 
considered in these studies is the different mechanisms 
used by bacterial cells in replication of the leading and 
the lagging strands when cell replication and transcription 
occur simultaneously (3,4). Specifically, during chromo- 
somal replication, deoxyribonucleic acid (DNA) and ribo- 
nucleic acid (RNA) polymerases move in the same 
direction on the leading strand but in opposite directions 
on the lagging strand, creating the possibihty of head-on 
collisions between the two polymerases during transcrip- 
tion of some genes on the lagging strand, hence making 
the lagging strand the less efficient one between the two 
(1,4). In an earlier study, Brewer (3) suggested that bac- 
terial cells may be under a selection pressure to have 
highly expressed genes reside on the leading strand. 
Rocha and Danchin (5,6) recently argued that it is really 
the essentiality instead of the needed expression levels of 
genes that may have driven certain genes to the leading 
strand. Although this interpretation seems to be correct, it 
provides only a partial answer as essential genes account 
for only a smaU portion of the whole gene set encoded in a 
bacterial genome, e.g. '^10% in Escherichia coli (7,8) and 
~10% in Bacillus suhtilis (9). Price et al. (10) observed that 
longer operons tend to be on the leading strand and sug- 
gested that there may be a selection pressure to have such 
an arrangement to avoid interruptions during transcrip- 
tion of such operons. Furthermore, Rocha (6,1 1) observed 
that the presence/absence of the DNA polymerase PolC 
in a genome is highly correlated with bacterial genomes 
having at least 70% of their genes on the leading strand or 
not. Hu et al. (12) proposed that replication-associated 
purine asymmetry may also contribute to the strand bias 
in a genome. In addition, Lin et al. (13) found that the 
essential genes on the leading strand are enriched in only a 
few of sub-categories of clusters of orthologous groups 
(14). Although this analysis provided useful insights of 
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functional preference of genes to the leading and lagging 
strand, a larger analysis involving more genes and organ- 
isms is needed to ensure the generality of the observation. 
More importantly, the general issues of why the majority 
of bacterial genes tend to be located on the leading strands 
and why the percentage of leading strand genes has such a 
large variation across different organisms remain largely 
unanswered. 

We present in this study a computational analysis of all 
the sequenced bacterial genomes aiming to provide a more 
general explanation to the above two observations. Our 
key findings are (i) genes of different functional categories 
have different level of tendency to be on the leading 
strand; (ii) genes of some functional categories such as 
transcription factor have higher preferences to be on the 
lagging strands; (iii) there is at least one balancing force 
that keeps genes from all moving to the leading strand 
during evolution, i.e. a more balanced genome facilitates 
a higher gene density in a genome and (iv) the percentage 
of leading-strand genes for a bacterium can be accurately 
explained in terms of genes in some functional categories 
outhned in (i) to (ii), genome size and gene density. On the 
basis of these findings, we believe that the percentage of 
genes on the leading versus lagging strand in a genome is 
the result of two sets of balancing forces, one that tends to 
drive genes of certain functional categories to the leading 
strands to make the bacteria more efficient in their re- 
sponses to environmental changes and one that tends to 
keep the genome as compact as possible to stay energetic- 
ally efficient when replicating and maintaining the 
genome. 

MATERIALS AND METHODS 

Data 

The 725 bacterial genome sequences along with their 
predicted genes and functional annotations were retrieved 
from the NCBI FTP site (ftp://ftp.ncbi.nih.gov/genomes/ 
Bacteria/) as of 11 December 2010. The gene ontology 
(GO) annotations for these genomes were from the 
GOA Proteonie Sets (v52) (15), and the GOslim 
definitions were downloaded from the Gene Ontology 
site (http : / /www.geneontology . org/GO_slims /goslim_ 
generic. obo) (16). The microarray data for E. coli 
are downloaded from the M3D web site (http://m3d.bu. 
edu) (17). 

High-level functional annotations of genes 

GO (16) was used to define functional categories of gene 
products. Based on the GO annotation and GO hierarchy 
information, the Perl script map2slim (http://search.cpan 
.org/~cmungall/go-perl/scripts/map2slim) was used to the 
bacterial genomes for assignment of GOshm-based func- 
tional categories. 

Determination of genes on leading and lagging strands 

To determine whether a gene is on the leading versus 
lagging strand of a genome, the origin and the terminus 
of replication are needed. The origin of replication for 



each of the 725 bacterial genomes was retrieved from 
the Doric database (18), which has been widely used 
in the comparative genomics analysis (19-21) and the 
origin prediction for newly sequenced genomes (22-28). 
The terminus of replication is thus calculated as the 
location of origin of replication plus half of the chromo- 
some length. With these two positions, the leading 
and lagging strands are determined for each half of the 
chromosome according to a well-known fact that the 
leading strand always has more genes than the lagging 
strand does (11). For each bacterium, only the major 
chromosome is considered, and plasmids are excluded 
in this study. 

Preference of functional categories on different strands 

Given a GO functional category, an index x is calculated 
using the following formula: 

«o 

X — 

where «o is the number of leading-strand genes of this 
functional category, iix is the number of lagging-strand 
genes of this category; x is calculated for all the GOslim 
functional categories, respectively, on the leading strand 
for aU 725 genomes, so that for each category, there is 
a data set (A) of 725 values. In addition, the overall 
percentage of leading-strand genes (data set B) is 
obtained for each of the 725 genomes as well. For each 
GOslim functional category, a Wilcoxon rank sum test 
was performed to test whether the data sets A and B are 
from two distinct distributions. We also used the similar 
procedure to assess the preference of functional categories 
on the lagging strand. All the statistical analyses are 
conducted using the R statistical language (http://www.r- 
project.org). 

Prediction of the percentage of leading-strand 
genes in a genome 

Network training 

A neural network model, with one hidden layer of 10 
nodes, is used to predict the percentage of leading-strand 
genes in a genome using the total 57 inputs and then 
selected smaller numbers of 30, 25, 20, 15, 10 and 5 ones 
in this study. To reduce the possibihty of the over-fitting 
problem, we used an early stopping technique, which 
divides the data into three subsets: training set used for 
computing gradient and updating the network weights 
and biases, vahdation set used for monitoring training 
process by its error rate and testing set used for assessing 
the neural network performance independently. When the 
network starts to over-fit the data, the error on the vahd- 
ation set begins to rise, and hence, the training process is 
stopped early. We used the default setting in the 
MATLAB Neural Network Toolbox, which arbitrarily 
divides the data into the three subsets, respectively: 507 
(70%) for training set, 109 (15%) for vahdation set and 
109 (15%) for testing set. The performance of a neural 
network is measured by mean squared error (MSB) and 
Pearson correlation score (R). The trained neural network 
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can be downloaded from http://csbl.bmb.uga.edu/ 
~xizeng/research/gene_strand_bias/. 

Variable selection 

Out of the initial set of 57 input variables, we have con- 
ducted a variable selection process based on the idea of 
mean impact value (MIV) (29). Based on the ranks of the 
MlVs of the input variables, input variables with insignifi- 
cant MIV will be eliminated from the neural network 
model. MIVs are calculated as foUows: vary the value of 
each input variable by increasing and decreasing 10% for 
all samples and get two outputs. Then subtract one from 
the other and obtain the impact change value [impact 
value (IV)] of the output due to the changes of the input 
variable values. Then the MIV is obtained by averaging 
the IVs across all trained networks: MIV = where n is 
the times of network training. 



RESULTS 

Characteristics of genes on leading strands 

We have analyzed 725 sequenced eubacterial genomes for 
which origins of replication and GO-based annotations 
are available in tenns of the strand biases of their 
protein-encoding genes, and archaea were excluded from 
our analysis as they may have multiple origins of replica- 
tions. Figure lA shows the percentage distribution of 
leading-strand genes across all the 725 genomes, ranging 
from 45% to ~90%. This observation extends a previous 
observation made based on a few bacterial genomes. 
Across these 725 bacteria, the percentage of leading-strand 
genes does not show any correlation with genome sizes in 
terms of gene numbers (Supplementary Figure SI), 
whereas different phyla have substantially different 
averaged percentages of leading-strand genes 
(Supplementary Figure S2), which is consistent with a 
previous finding made on a smaller group of bacterial 
genomes (1 1,30). 

We have also examined the relationship between leading 
strand bias and the growth rate for 104 of the 725 bacterial 
genomes, for which the doubling-time information is 



available (31). We found that bacterial genomes with 
high leading-strand bias (>70%) tend to have higher 
growth rates than those with low leading-strand bias 
(<70%) measured by the Wilcoxon rank sum test with 
P value: 1.9 x lO""*, as shown in Figure IB. This observa- 
tion changes the previous conclusion that fast-growing 
bacteria have similar leading-strand bias to that of the 
slow-growing bacteria (11), which was made based on a 
substantially smaller number of genomes with known 
growth rates. 

Functional categories whose genes have different 
preference to different strands 

We have examined whether genes of different functional 
categories may have different level of preference to be on 
the leading versus the lagging strands across all bacteria. 
To do this, we checked 55 of the 127 GOslim functional 
categories (16) that have available gene assignments in at 
least 36 (5Vo) of the 725 genomes and have the number of 
genes with the median being between 5 and 500 (categories 
with >500 genes wiU be too general for our study) across 
all the genomes under consideration. For each of the 55 
functional categories, we consider a functional category 
prefers a strand if genes in this category have a higher 
percentage than the average percentage of all genes on 
the strand across all genomes. The Wilcoxon rank sum 
test is used to assess the statistical significance of an 
observed preference measured using a P value. We 
found that 32 of the 55 categories prefer the leading 
strand with P value < 0.01, including genes related to 
ribosome, structural molecular activity, translation, 
RNA binding and cell cycle, with the detailed information 
presented in Table 1; and 11 categories prefer the lagging 
strand with P value < 0.01, including DNA-binding tran- 
scription factor activity, signal transducer activity and 
regulation of biological process, with details in Table 2. 
On average, 52% of the genes encoded in a bacterial 
genome are covered by the 43 (32+ 11) functional 
categories, and the detailed distribution of percentage 
across different bacterial genomes is given in 
Supplementary Table SI. Notably, transcription factor 
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Figure 1. General characteristics for leading-strand genes. (A) Distribution of the number of bacteria with a specific percentage of genes on the 
leading strands; and (B) distribution of the percentages of leading-strand genes versus cell growth rate in the 104 bacterial genomes with growth rate 
data available. 
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Table 1. Preference of GOslim categories toward leading 


strands 


across 


725 bacterial genomes 




GO 


GO category 


Preference 


branch 




Mr 


LjU.uuujIVo stiLiciLirai molecule activiiy 


7.28E-156 


Mr 


OvJ.uuUi/zi KiNA Dinaing 


1.15E-107 


MF 


GO:0008135 translation factor activity 


4.67E-78 


MF 


LjU:uUUjM3 protein binding 


1.37E-32 


MF 


(jU:(jUU3//4 motor activity 


2.05E-24 


Mr 


ou.uuuuioD nucieotiue Dinuing 


5.93E-14 


Mr 


GO:0003676 nucleic acid binding 


1.04E-13 


MF 


GO:0030234 enzyme regulator activity 


1.97E-08 


MF 


00:0016740 transferase activity 


9.62E-05 


BP 


LjU:UUUo412 translation 


6.77E-118 


RP 


o^.UvU/U't7 ceil cycie 


8.66E-72 


DD 
or 


GO:0019538 protein metabolic process 


7.64E-56 


Br 


GO:0015031 protein transport 


7.60E-34 


BP 


00:0016043 cellular component organization 


1.81E-30 


BP 


GO:0009605 response to external stimulus 


1.89E-20 


BP 


GO:0007154 cell communication 


2.51E-18 


BP 


GO:UUU6U91 generation ot precursor 


l.llE-17 




metabolites and energy 




rSr 


GO:0005975 carbohydrate metabolic process 


7.17E-16 


BP 


GO:0019748 secondary metabolic process 


3.10E-15 


BP 


GO:0006629 lipid metabolic process 


1.37E-11 


BP 


GO:0009056 catabolic process 


1.51E-06 


BP 


GO:0006519 cellular amino acid and derivative 


2.41E-06 




metabolic process 




BP 


GO:0006811 ion transport 


5.37E-05 


BP 


GO:0006950 response to stress 


3.26E-03 


CC 


GO:0005840 ribosome 


1.7/ r.- 1 Jo 


CC 


GO:0043226 organelle 


5.17E-116 


CC 


GO:0005737 cytoplasm 


6.29E-56 


CC 


GO:0005622 intracellular 


6.67E-41 


CC 


GO:0005694 chromosome 


l.llE-26 


CC 


GO:0043234 protein complex 


6.98E-26 


CC 


GO:0005618 cell wall 


5.31E-11 


CC 


GO:0005886 plasma membrane 


1.42E-06 


The first column represents the three major GO categories: molecular 


function (MF), cellular component (CC) and biological process (BP). 


Table 2. Preference of GOslim categories toward lagging 


strands 


across 


725 bacterial genomes 




GO branch GO category 


Preference 


MF 


GO:0003700 sequence specific DNA binding 


3.09E-34 




transcription factor activity 




MF 


GO:00 16209 antioxidant activity 


3.31E-11 


MF 


GO:0003677 DNA binding 


3.41E-11 


MF 


GO:0004871 signal transducer activity 


2.10E-06 


MF 


GO:0004672 protein kinase activity 


9.34E-06 


MF 


GO:0008233 peptidase activity 


1.08E-05 


BP 


GO:0050789 regulation of biological process 


3.20E-15 


BP 


GO:00 19725 cellular homeostasis 


3.08E-11 


BP 


GO:0006350 transcription 


2.94E-09 


BP 


GO:0006464 protein modification process 


1.06E-06 


BP 


GO:0007165 signal transduction 


6.31E-05 



The first column represents the three major GO categories: molecular 
function (MF), cellular component (CC) and biological process (BP). 



activity (GO:0003700) shows strong preference to tlie 
lagging strand. To confirm it, we have examined the set 
of all 271 annotated transcription factors in E. coli 
from the RegulonDB database (32) and found the same 



strand preference with P value 5.8 x 10~^ (Supplementary 
Table S2). One possible explanation is that transcription 
factors, particularly non-global transcription factors, are 
known to have low expression levels (33) and, hence, rep- 
resent the last group of genes to move to the leading 
strand during evolution. 

To check whether our analysis covers the observation 
that essential genes tend to be on the leading strands made 
by Rocha and Danchin (5), we created an artificial func- 
tional category 'essential genes' and applied our analysis 
to all the essential genes in 13 bacterial genomes in the 
DEG database (34), which has the most comprehensive 
annotated essential gene hst. No surprise here as this 
category has a significant P value for preferring to be on 
the leading strand (Supplementary Figure S3), indicating 
that our explanation covers the observation made by 
Rocha and Danchin (5). 

A balancing force: strand bias versus gene density 

Our analysis suggests that there might be a selection 
pressure for a bacterium to have a more compact 
genome (i.e. a shorter genome without losing genes), par- 
ticularly in a complex environment. To check this hypoth- 
esis, we have examined the percentages of coding regions 
in the two groups of bacteria, one containing all bacteria 
with at least 70% of the genes on the leading strands 
and one containing all the other 725 bacteria and 
checked their relationship with the living styles of the 
bacteria. Our analysis revealed that (i) the bacteria in 
the second group (with lower strand bias) tend to have 
higher percentages of coding regions than those in the 
first group, with a P value 1.8x10"** based on the 
Wilcoxon rank sum test, as shown in Figure 2A and 
(ii) this tendency is more significant for bacteria living in 
complex environments, with P values ranging from 0.25 to 
2.8 X 10~', as shown in Figure 2B-F. One possible explan- 
ation is that there might be a selection pressure for 
bacteria hving in nutrient-depleted environments to keep 
their genomes as compact as possible (without losing 
genes), and having a more balanced genome is one way 
to achieve this goal (a more balanced genome seems to 
allow a higher degree of overlap between regulatory 
regions of operons). 

A model for interpreting the percentage of 
leading-strand genes 

Our main hypothesis is that the percentage of leading- 
strand genes in a genome reflects the relationship 
between the key functionalities and the living environment 
of an organism. To check this hypothesis, we have 
examined the population of genes in each functional 
category encoded in each genome to see whether some 
of them can be used to predict the percentage of 
leading-strand genes. 

We trained 10 times a neural network with 57 input 
nodes, one node for each of the 55 functional categories, 
one node for gene density and one node for the genome 
size; one hidden layer of 10 nodes and one output 
node, where gene density is calculated as the percentage 
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Figure 2. Boxplots of the percentage of coding region versus the percentage of leading strand genes in a genome. (A) For all bacteria {P value of 
the Wilcoxon test: l.I x 10^**); (B) bacteria of specialized type with P value 0.22; (C) bacteria of host-associated type with P value 0.54; (D) 
bacteria of aquatic type with P value 0.065; (E) bacteria of terrestrial type with P value 0.0031 and (F) bacteria of multiple type with P value 
1.9 X lO"'. 



of non-coding region lengtli against chromosome 
lengtli. We split tlie 725 genomes into three sets: 507 
(70%) as the training set, 109 (15%) as the validation 
set and 109 (15%) as the testing set. At the end of the 
training, the nem-al network has the following average 
performance results on the three data sets: MSE = 
0.0015 and R (Pearson correlation score) = 0.91 
between the desired and predicted values on the training 
set; MSE = 0.0023 and R = 0.85 on the validation set and 
MSE = 0.0021 and R = 0.87 on the testing set. Figure 3 
shows the performance of a trained neural network on the 
different data sets. 

Using the variable selection procedure outlined in 
'Materials and Methods' section, we have examined the 
IV of each input on the performance of each neural 
network trained by increasing or decreasing its value by 
10% and used the averaged IV (MIV) for the 10 trained 
neural networks as a measure of the importance level of 



that variable. We have examined the performance of 
neural networks with smaller numbers of inputs with top 
MIV values: 30, 25, 20, 15, 10 and 5, as shown in Figure 4. 
Each network was trained three times. The networks 
with 25 inputs work best on the different data sets 
with the foUowing average performance: MSE = 0.0018 
and R = 0.89 on the training set; MSE = 0.0019 and 
R = 0.86 on the validation set and MSE = 0.0017 and 
R = 0.88 on the testing set. Specifically, these 25 inputs 
are hsted in Table 3: cell cycle, iron transport, transport, 
response to stress, nucleobase, cellular homeostasis, trans- 
lation and generation of precursor metabohtes and energy 
under the biological process category; ribosome, cyto- 
plasm, cell envelope and protein complex under the 
cellular component category; RNA binding, electron 
carrier activity, kinase activity, translation factor activity 
and structure molecule activity under the molecular 
function category; along with genome size and gene 
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Figure 3. Performance in predicting the percentage of leading-strand genes in a 
and testing set, respectively. 



genome by our trained neural network on the training, validation 



density. This prediction result is highly consistent with our 
above results shown in Figure 3. 

Using the selected 25 variables from each genome, we 
have constructed a new neural network with one hidden 
layer to predict the overall percentage of genes on the 
leading strand as follows: 

P-T. /(E ^^'■)' i^/ = — , ^1 = 25, /C2 = 10 

/=! ,= 1 " -"^''.max 

where P is the percentage of leading-strand genes in a 
genome, /' is a hyperbolic tangent sigmoid transfer 
function, ii'.y' is the weight of the /th input to the /th 
node of the hidden layer and w^J^ is the weight of the /th 
node in the hidden layer to the output node in the neural 
network model, k\ is the number of variables, k2 is the 
number of nodes of the hidden layer, /7,- is a scaling factor 
calculated as the ratio between the variable (a,) and the 
max value (A',_niax) of this variable across all 725 bacterial 
genomes. 

We speculate that the genes of certain functional 
categories need to be on the leading strands when living 



in certain environments to out-compete their competitors 
when food is limited and the competition is high; other- 
wise, the organism may keep the genes on the lagging 
strands as a more balanced genome may mean a more 
compact genome, which requires lower maintenance 
energy. A good example is that the chemotactic response 
of Pseudoalteromonas haloplanktis in exploiting ephemeral 
microscale nutrient patches is at least 10 times faster than 
that of E. coli (35), suggesting that P. hcdoplanktis may be 
genetically optimized for this particular capabihty. To 
check whether some genes are specifically located on the 
leading strand of the organism, we examined the strand 
distribution of genes across the 55 GOslim functional 
categories on P. haloplanktis and E. coli, and we found 
that genes of some functional categories are significantly 
enriched (with P value < 0.05) on the leading strand of 
P. hcdoplanktis than that of E. coli, including protein 
transport, DNA metabolic process, ion transport and 
signal transduction under the biological process 
category; organelle and intracellular under the cellular 
component; antioxidant activity and motor activity 
(Supplementary Table S3). This clearly makes sense as 
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Figure 4. Evaluation of performance in the percentage of leading- strand genes in a genome with smaller numbers of inputs by our trained neural 
network. 



Table 3. Twenty-five selected inputs used in the neural network 
model 



Category 


Variable 


MIV 


BP 


GO:0007049 cell cycle 


0.012344 


BP 


GO:0006811 ion transport 


0.005643 


BP 


GO;0006629 lipid metabolic process 


-0.00275 


BP 


00:0019748 secondary metabolic process 


-0.00294 


BP 


00:0006810 transport 


-0.00332 


BP 


00:0006950 response to stress 


-0.00388 


BP 


00:0006139 nucleobase 


-0.00402 


BP 


00:0019725 cellular homeostasis 


-0.00466 


BP 


00:0006412 translation 


-0.00825 


BP 


00:0006091 generation of precursor metabolites 


-0.01027 




and energy 




CC 


00:0005840 ribosome 


0.009582 


CC 


00:0005737 cytoplasm 


0.007178 


CC 


00:0005622 intracellular 


0.003747 


CC 


00:0043226 organelle 


0.002657 


CC 


00:0030312 external encapsulating structure 


-0.00234 


CC 


00:0030313 cell envelope 


-0.00319 


CC 


00:0043234 protein complex 


-0.0049 


MF 


00:0003723 RNA binding 


0.027956 


MF 


00:0009055 electron carrier activity 


0.003654 


MF 


00:0016301 kinase activity 


0.002421 


MF 


00:0030234 enzyme regulator activity 


-0.00182 


MF 


00:0008135 translation factor activity 


-0.00304 


MF 


00:0005198 structural molecule activity 


-0.01745 


OT 


Oene density 


-0.00451 


OT 


Oenome size 


-0.00515 



Biological process (BP), cellular component (CC) and molecular 
function (MF) in the first column are the top-level categories in the 
gene ontology (GO) hierarchy; OT is for other variables that are not 
GO categories. 



DISCUSSION 

It has been observed that bacterial genomes have a large 
variation in terms of the percentage of their leading-strand 
genes, ranging from ~45% to ~90%. We have provided 
an explanation for the large variation of observed strand 
biases across 725 bacterial genomes, which extends sub- 
stantially the previous explanations. Our key contribu- 
tions through this study include that (i) the genes of 
certain functional categories that need to be on the 
leading strands of genomes, to enhance the survivabiUty 
of the host; (ii) genes of some functional categories such as 
transcription factor have higher preference to be on the 
lagging strands; (iii) there is at least one balancing force 
that keeps genes from all moving to the more efficient 
leading strands during evolution, particularly in 
nutrient-depleted environments and (iv) the percentage 
of leading-strand genes for a bacterial genome can 
be well explained using the numbers of genes in 25 func- 
tional categories outlined in (i) to (ii), genome size and 
gene density. We anticipate that more sophisticated 
analyses could possibly lead to quantitative models 
relating the percentage of leading-strand genes in a bac- 
terium to a few parameters, which reflect the relationships 
between the living environments of an organism and the 
'intended' capabilities of the organism and the needs for 
its survival, giving rise to improved understanding about 
the rules that may determine which genes wiU be on the 
leading versus the lagging strand of a genome. 



collectively having more genes related to motor activity, 
transporter activity and signal transduction on the leading 
strand may enable the bacteria to react much faster when 
the nutrients become available (36,37). 



SUPPLEMENTARY DATA 
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